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,__! Abstract 

^^ We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) 

Cm optimal regret guarantee. 



o 

(N 



O 



1 Introduction 



Mirror Descent is a first-order optimization procedure which generalizes the classic Gradient Descent procedure to 

I— I non-Euclidean geometries by relying on a "distance generating function" specific to the geometry (the squared £2- 

rn norm in the case of standard Gradient Descent) [14, 4]. Mirror Descent is also applicable, and has been analyzed, 

1 in a stochastic optimization setting [9] and in an online setting, where it can ensure bounded online regret [20]. In 

• fact, many classical online learning algorithms can be viewed as instantiations or variants of Online Mirror Descent, 

Q generally either with the Euclidean geometry (e.g. the Perceptron algorithm [5] and Online Gradient Descent [27]), or 

'~~' in the simplex (ii geometry), using an entropic distance generating function (Winnow [13] and Multiplicative Weights 

,__! / Online Exponentiated Gradient algorithm [11]). More recently, the Online Mirror Descent framework has been 

J> applied, with appropriate distance generating functions derived for a variety of new learning problems like multi-task 

^^ learning and other matrix learning problems [10], online PC A [26] etc. 

QQ In this paper, we show that Online Mirror Descent is, in a sense, universal. That is, for any convex online learning 

problem, of a general form (specified in Section 2), if the problem is online leamable, then it is online learnable, with 
a nearly optimal regret rate, using Online Mirror Descent, with an appropriate distance generating function. Since 
^^ Mirror descent is a first order method and often has simple and computationally efficient update rules, this makes the 

~^ result especially attractive. Viewing online learning as a sequentially repeated game, this means that Online Mirror 

I Descent is a near optimal strategy, guaranteeing an outcome very close to the value of the game. 

H" In order to show such universality, we first generalize and refine the standard Mirror Descent analysis to situations 

. 1^ where the constraint set is not the dual of the data domain, obtaining a general upper bound on the regret of Online 

S^ Mirror Descent in terms of the existence of an appropriate uniformly convex distance generating function (Section 3). 

Jh We then extend the notion of a martingale type of a Banach space to be sensitive to both the constraint set and the data 

domain, and building on results of [24], we relate the value of the online learning repeated game to this generalized 

notion of martingale type (Section 4). Finally, again building on and generalizing the work of [16], we show how 

having appropriate martingale type guarantees the existence of a good uniformly convex function (Section 5), that in 

turn establishes the desired nearly-optimal guarantee on Online Mirror Descent (Section 6). We mainly build on the 

analysis of [24], who related the value of the online game to the notion of martingale type of a Banach space and 

uniform convexity when the constraint set and data domain are dual to each other. The main technical advance here 

is a non-trivial generalization of their analysis (as well as the Mirror Descent analysis) to the more general situation 

where the constraint set and data domain are chosen independently of each other. In Section 7 several examples are 

provided that demostrate the use of our analysis. 

Mirror Descent was initially introduced as a first order deterministic optimization procedure, with an £p constraint 
and a matching £q Lipschitz assumption (1 < p < 2, 1/q + 1/p = 1), was shown to be optimal in terms of the 
number of exact gradient evaluations [15]. Shalev-Shwartz and Singer later observed that the online version of Mirror 



Descent, again with an £p bound and matching £q Lipschitz assumption (1 < p < 2, 1/q + 1/p = 1), is also optimal in 
terms of the worst-case (adversarial) online regret. In fact, in such scenarios stochastic Mirror Descent is also optimal 
in terms of the number of samples used. We emphasize that although in most, if not all, settings known to us these 
three notions of optimality coincide, here we focus only on the worst-case online regret. 

Sridharan and Tewri [24] generalized the optimality of online Mirror Descent (w.r.t. the worst case online regret) to 
scenarios where learner is constrained to a unit ball of an arbitrary Banach space (not necessarily and £p space) and the 
objective functions have sub-gradients that lie in the dual ball of the space — for reasons that will become clear shortly, 
we refer to this as the data domain. However, often we encounter problems where the constraint set and data domain 
are not dual balls, but rather are arbitrary convex subsets. In this paper, we explore this more general, "non-dual", 
variant, and show that also in such scenarios online Mirror Descent is (nearly) optimal in terms of the (asymptotic) 
worst-case online regret. 

2 Online Convex Learning Problem 

An online convex learning problem can be viewed as a multi-round repeated game where on round t, the learner first 
picks a vector (predictor) W( from some fixed set W, which is a closed convex subset of a vector space B. Next, 
the adversary picks a convex cost function ft : W i-^ R from a class of convex functions T. At the end of the 
round, the learner pays instantaneous cost /j(wt). We refer to the strategy used by the learner to pick the /f's as an 
online learning algorithm. More formally, an online learning algorithm A for the problem is specified by the mapping 
A : UneN -^"^^ '^ W. The regret of the algorithm A for a given sequence of cost functions /i, ...,/„ is given by 

_. n 1 ^^ 

R„(A /i, ...,/„) = - ^ ft{A{h:t-i)) - inf - Y. M^) . 
t=i t=i 

The goal of the learner (or the online learning algorithm), is to minimize the regret for any n. 

In this paper, we consider cost function classes T specified by a convex subset X C B* of the dual space B* . We 
consider various types of classes, where for all of them, subgradients' of the functions in T lie inside X (we use the 
notation (x, w) to mean applying linear functional xEB*onwEB): 

J'upiX) = {/ : / is convex Vw e W, V/(w) e X} , -Funl-^) = {w h^ (x, w) : x e <Y} , 

-T^supl-^) = {w ^ \{x,w) - y\ : yi e X ,y e [-b,b]} 

The value of the game is then the best possible worst-case regret guarantee an algorithm can enjoy. Formally the 
value is defined as : 

Va{T,X,W)^M sup R„(A/l:«) (1) 

It is well known that the value of a game for all the above sets T is the same. More generally: 

Proposition 1. If for a convex function class T, we have that V/ G J^, w G >V, V/(w) G X then, 

V„(J-,A',W)<V„(J-ii„,A',>V) 

Furthermore, V„(J"Lip, A-, W) - V„(J-,up, X, W) = V„(J"ii„, X, >V) 

That is, the value for any class T with subgradients in >V, which include all the above classes, is upper bounded 
by the value of the class of linear functionals in >V, see e.g. [1]. In particular, this includes the class Jxip which is 
the class of all functions with subgradients in W, and thus, since Ty^rAX^ C T\Ap{X) we get the first equality. The 
second equality is shown in [18]. 

'Throughout we commit to a shght abuse of notation, with V/(w) indicating some sub-gradient of / at 'w and V/('w) G X meaning that at 
least one of the sub-gradients is in A*. 



The class Jsup(<^) corresponds to linear prediction with an absolute-difference loss, and thus its value is the 
best possible guarantee for online supervised learning with this loss. We can define more generally a class Ti — 
{£{{x,w),y) : X E X,y £ [— &, 6]} for any 1-Lipschitz loss £, and this class would also be of the desired type, with its 
value upper bounded by V„(J^iin, X, W). In fact, this setting includes supervised learning fairly generally, including 
problems such as multitask learning and matrix completion, where in all cases X specifies the data domain^. The 
equality in the above proposition can also be essentially extended to most other commonly occurring convex loss 
function classes like say the hinge loss class with some extra constant factors. 

Owing to Proposition 1, we can focus our attention on the class Tun (as the other two will behave similarly), and 
so use the shorthand 

Vn{W,X):^Vn{Tlin,X,W) (2) 

and henceforth the term value without any qualification refers to value of the linear game. Further, for any p G [1,2] 
we us also define : 



|y yneN,VniW,X) <Vn~i^-p)\ (3) 



Vp := inf ■ 

Most prior work on online learning and optimization considers the case when W is the unit ball of some Banach 
space, and X is the unit ball of the dual space, i.e. W and X are related to each other through duality. In this work, 
however, we analyze the general problem where X G B* is not necessarily the dual ball of W. It will be convenient 
for us, however, to relate the notions of a convex set and a corresponding norm. The Minkowski functional of a 
subset /C of a vector space V is defined as ||v||y^ := inf {a > : v G aJC}. If /C is convex and centrally symmetric 
(i.e. K. = /C), then | | ^^ is a semi-norm. Throughout this paper, we will require that W and X are convex 
and centrally symmetric. Further, if the set /C is bounded then \\-\\iq is a norm. Although not strictly required for 
our results, for simplicity we will assume W and X are are such that ||-||yy and ||-||^ (the Minkowski functionals of 
the sets W and X) are norms. Even though we do this for simplicity, we remark that all the results go through for 
semi-norms. We use X* and W* to represent the dual of balls X and W respectively, i.e. the unit balls of the dual 
norms ||-||^ and ||-||^. 

3 Mirror Descent and Uniform Convexity 

A key tool in the analysis mirror descent is the notion of strong convexity, or more generally uniform convexity: 
Definition 1. "$ : B ^^ M. is q-uniformly convex w.r.t. \\ ■ \\ in W C B: 



Vw,w'ewV„e[oa] * ("w + (1 - a)W) < a*(w) + (1 - a)*(w') - '^^ \\ 



'II? 
w — w 



It is important to emphasize that in the definition above, the norm ||.|| and the subset W need not be related, and 
we only require uniform convexity inside W. This allows us to relate a norm with a non-matching "ball". To this end 
define. 



Dp := inf < sup •^{w) 
\wew 



* : W i-> M+ is ^-uniformly convex w.r.t. ||-||_^. on W, *(0) = 



Given a function ^f, the Mirror Descent algorithm, .4md is given by 

Wf+i = argmin A^ (w|w4) + 77(V/t(wi), w - Wf) (4) 

or equivalently 'w[_^_i = V^* (V^(wi) — ?7V/f(wt)) , wt+i = argmin A^, (w|wj^]^) (5) 

wew 



^Note that any convex supervised learning problem can necessarily be viewed as linear classification with some convex constraint W on the 
predictors. 



where A5, (w|w') := $(w) — ^(w') — (V^(w'), w — w') is the Bregman divergence and ^* is the convex conjugate 
of ^. As an example notice that when ^(w) = ^ ||w||2 then we get back the gradient descent algorithm and when 

W is the d dimensional simplex and ^^(w) = J2i=i'^i^'^s{^/'^i) then we get the multiplicative weights update 
algorithm. 

Lemma 2. Let ^ : B t-^ R be non-negative and q-uniformly convex w.r.t. norm ||-||_^>, on W. For the Mirror 
Descent algorithm with this ^, using Wi = argmin $(w) and rj = I ^"P'^gw — (^ J ^^ ^^^ guarantee that for any 
/!,...,/„ s.t. i ELi l|V/t||^ < 1 (where p= ^j, 

M(A f . N /o/^sup^GW*(w] 

-K-i-^MD,/!, • • • ,Jn) S ^ 



\ n 

Note that in our case we have V/ € X, i.e. || V/||_:^, < 1, and so certainly - X]"=i ll^/tllA' — 1- Similarly to the 
value of the game, for any p G [1, 2], we define: 

MDp :=infi 15:3*, ?7 s.t. VneN, sup UniAmii, hn) < Dn^^^^^A (6) 

where the Mirror Descent algorithm in the above definition is run with the corresponding ^ and 77. The constant MDp 
is a characterization of the best guarantee the Mirror Descent algorithm can provide. Lemma 2 therefore implies: 

Corollary 3. V^ < MDp < 2Dp. 

Proof. The first inequality is essentially by the definition of Vp and MDp. The second inequality follows directly from 
previous lemma. D 

The Mirror Descent bound suggests that as long as we can find an appropriate function ^ that is uniformly convex 
w.r.t. II 'll^ we can get a diminishing regret guarantee using Mirror Descent. This suggests constructing the following 
function: 

^q := argmin sup ^(w) . (7) 

^/j:^ is Q-uniformly convex wGW 
w.r.t. |H|;4.. onWandi/)>0 

If no g-uniformly convex function exists then ^^ = 00 is assumed by default. The above function is in a sense the 
best choice for the Mirror Descent bound in (2). The question then is: when can we find such appropriate functions 
and what is the best rate we can guarantee using Mirror Descent? 

4 Martingale Type and Value 

In [24], it was shown that the concept of the Martingale type (also sometimes called the Haar type) of a Banach 
space and optimal rates for online convex optimization problem, where X and W are duals of each other, are closely 
related. In this section we extend the classic notion of Martingale type of a Banach space (see for instance [16]) to 
one that accounts for the pair (W*, X). Before we proceed with the definitions we would like to introduce a few 
necessary notations. First, throughout we shall use e G {±1}^ to represent infinite sequence of signs drawn uniformly 
at random (i.e. each Ci has equal probability of being +1 or —1). Also throughout (x„)„gN represents a sequence of 
mappings where each x„ : {±1}"^^ h^- B*. We shall commit to the abuse of notation and use x„(e) to represent 
Xn(e) = x„(ei, . . . , e„_i) (i.e. although we used entire e as argument, x„ only depends on first n — 1 signs). We are 
now ready to give the extended definition of Martingale type (or M-type) of a pair (W* , X). 



Definition 2. A pair (W*, X) of subsets of a vector space B* is said to be ofM-type p if there exists a constant C > 1 
such that for all sequence of mappings (x„)„>i where each x„ : {±1}"~^ H- B* and any xg G B* : 



supE 



xo + y^e,x,(£) 



i=l 



W* 



< 



^Miixoii^ + EiEiiixnWii!;.] 



(8) 



The concept is called Martingale type because (e„x„(e))„£N is a martingale difference sequence and it can be 
shown that rate of convergence of martingales in Banach spaces is governed by the rate of convergence of martingales 
of the form Z„ = xg + '^Zl^i ^O^i{^) (which are incidentally called Walsh-Paley martingales). We point the reader to 
[16, 17] for more details. Further, for any p e [1, 2] we also define. 



Cp — inf { C 



Vxo G S*,V(x„)„gN, supE 



Xo 



^eiXi(e 



<CM||xor^ + ^E||x„(.)||^ 



Cp is useful in determining if the pair (W*, X) has Martingale type p. 

The results of [24, 18] showing that a Martingale type implies low regret, actually apply also for "non-matching" 
W and X and, in our notation, imply that Vp < 2Cp. Specifically we have the following theorem from [24, 18] : 



Theorem 4. [24, 18] For any W £ B and any X e B* and any n > 1, 



supE 



1 " 

- Ve,x,(e) 

n ^ — ^ 



W*J 



< Vn{W,X) < 2supE 



1 " 

EeiXi(f; 



w*J 



where the supremum above is over sequence of mappings (x„)„>i where each x„ : {±1}" ^ i-^ X. 

Our main interest here will is in establishing that low regret implies Martingale type. To do so, we start with the 
above theorem to relate value of the online convex optimization game to rate of convergence of martingales in the 
Banach space. We then extend the result of Pisier in [16] to the "non-matching" setting combining it with the above 
theorem to finally get : 

Lemma 5. If for some r G (1,2] there exists a constant D > such that for any n, 

Vn{W,X)<Dn-^^-^^ 

then for all p < r, we can conclude that any xq G B* and any B* sequence of mappings (x„)„>i where each 
x„ : {±l}"-i H^ B* will satisfy : 



supE 



Xo +y^£iXi(e) 



w*-i 



<|iM£y|||xor^ + 5]E[||x,(.)||^] 



{r-py 



That is, the pair (W, X) is of martingale type p. 

The following corollary is an easy consequence of the above lemma. 



Corollary 6. For any p G [1,2] and any p' < p : Cp' < 



1104 Vp 



5 Uniform Convexity and Martingale Type 

The classical notion of Martingale type plays a central role in the study of geometry of Banach spaces. In [16], it 
was shown that a Banach space has Martingale type p (the classical notion) if and only if uniformly convex functions 
with certain properties exist on that space (w.r.t. the norm of that Banach space). In this section, we extend this result 
and show how the Martingale type of a pair {W*, X) are related to existence of certain uniformly convex functions. 
Specifically, the following theorem shows that the notion of Martingale type of pair (W*, X) is equivalent to the 
existence of a non-negative function that is uniformly convex w.r.t. the norm 



\x 



onW. 



Lemma 7. If, for some p G (1, 2], there exists a constant C > 0, such that for all sequences of mappings (x„)„>i 
where each x„ : {±1}"^^ h- > B* and any xq G i?*; 



supE 



xo + y^e»x,(£) 



w*J 



< 



cM||xo||^ + 5]e[||x„(6)||^] 



n>l 



f/.e. (W*, A") /zfli Martingale type p), then there exists a convex function ^ : i3 h- > M+ with ^(0) = 0, f/zaf is 
q-uniformly convex w.r.t. norm \\-\\x* s.t. Vw (z B, - H'wH^* < ^(w) < — ||w||^. 

The following corollary follows directly from the above lemma. 

Corollary 8. For any p e [1, 2], Dp < Cp. 

The proof of Lemma 7 goes further and gives a specific uniformly convex function ^ satisfying the desired re- 
quirement (i.e. establishing Dp < Cp) under the assumptions of the previous lemma: 



**(x) — sup.^ ^supE 



x + y~^e,Xi(£) 



^E[||x,(e)||y \ , *,:=(** 



(9) 



where the supremum above is over sequences (x„)„gN and p ^ -^ 



9-1- 



6 Optimality of Mirror Descent 

In the Section 3, we saw that if we can find an appropriate uniformly convex function to use in the mirror descent 
algorithm, we can guarantee diminishing regret. However the pending question there was when can we find such a 
function and what is the rate we can gaurantee. In Section 4 we introduced the extended notion of Martingale type of 
a pair (W*, X) and how it related to the value of the game. Then, in Section 5, we saw how the concept of M-type 
related to existence of certain uniformly convex functions. We can now combine these results to show that the mirror 
descent algorithm is a universal online learning algorithm for convex learning problems. Specifically we show that 
whenever a problem is online learnable, the mirror descent algorithm can guarantee near optimal rates: 

Theorem 9. If for some constant V > and some q G [2, oo), Vri(W, X) < VnT'i for all n, then for any n > e''^^, 
there exists regularizer function ^ and step-size rj, such that the regret of the mirror descent algorithm using ^ against 
any /i , . • . , /„ chosen by the adversary is bounded as: 



R«(^MD,/i:„) < 6002 F log' (n) n-5 



(10) 



Proof. Combining Mirror descent guarantee in Lemma 2, Lemma 7 and the lower bound in Lemma 5 with p = 
-2-^ — -j — \-^ we get the above statement. D 

The above Theorem tells us that, with appropriate "^ and learning rate 77, mirror descent will obtain regret at most 
a factor of 6002 log(n) from the best possible worst-case upper bound. We would like to point out that the constant V 
in the value of the game appears linearly and there is no other problem or space related hidden constants in the bound. 

The following figure summarizes the relationship between the various constants. The arrow mark from Cpi to Cp 
indicates that for any n, all the quantities are within log' n factor of each other. 

We now provide some general guidelines that will help us in picking out appropriate function ^ for mirror descent. 
First we note that though the function ^^ in the construction (9) need not be such that ((/^^(w))^/'^ is a norm, with 
a simple modification as noted in [17] we can make it a norm. This basically tells us that the pair (W, X) is, online 
learnable, if and only if we can sandwich a g-uniformly convex norm in-between X* and a scaled version of W (for 
some q < 00). Also note that by definition of uniform convexity, if any function $ is g-uniformly convex w.rt. some 



norm 



and we have that 



>c 



\x 



then 



*(•) 



is g-uniformly convex w.rt. norm 



A" 



together suggest that given pair (W, X) what we need to do is find a norm 
smaller the C better the bound ) such that ||-||'' is q-uniformly convex w.r.t 



in between 



These two observations 

andCIMk^, (C < 00, 



\x 



iw 



p' < P, Cp, I <C 



K 



Lemma 5 
(extending Pisier's resuli [16]) 



<c 



MDr, 



<c 



Dr, 



< 



1 



Lemma 2 
(Generalized MD guaraniee) 



Construction of ^, Lemma 11 
(extending Pisier's resuli [16]} 



Figure 1 : Relationship between the various constants 

7 Examples 

We demonstrate our results on several online learning problems, specified by W and X. 

£p non-dual pairs It is usual in the literature to consider the case when W is the unit ball of the £p norm in some finite 
dimension d while X is taken to be the unit ball of the dual norm £q where p, q are Holder conjugate exponents. Using 
the machinery developed in this paper, it becomes effortless to consider the non-dual case when W is the unit ball Bp-^ 
of some Ip-^ norm while X is the unit ball Bp^ for arbitrary pi , p2 in [1 , oo] ■ We shall use qi and (72 to represent Holder 
conjugates of Pl andp2- Before we proceed we first note that for any r e (1, 2], i/i^lw) := 2fr-i) ll'^llr i^ 2-uniformly 
w.r.t. norm ||-||j, (see for instance [25]). On the other hand by Clarkson's inequality, we have that for r E (2, 00), 
■0r(w) := — llwllj; is r-uniformly convex w.r.t. ||-|jj,. Putting it together we see that for any r e (1, 00), the function 
ipr defined above, is Q-uniformly convex w.r.t ||-||^ for Q = max{r, 2}. The basic technique idea is to be to select tpr 
based on the guidelines in the end of the previous section. Finally we show that using ipr :== d '^^^v^ F' >, 
Mirror descent Lemma 2 yields the bound that for any /i ,...,/« G ^: 



'xpr in 



R.n(-4MD,/l:n) < 



2 max{2, 



v^u^ 



1 jmax{ J--i,0}+max{i-^,0> 



^ 1/ max{r,2} 



The following table summarizes the scenarios where a value of r = 2, i.e. a rate of D2 
the corresponding values of D2 (up to numeric constant of at most 16): 



n, is possible, and lists 



Pl Range 


92 - p^l^ Range 


D2 


1 < Pl < 2 
1 < Pl < 2 
1 < Pl < 2 
Pl >2 
Pl >2 
1 < Pl < 2 


92 > 2 

Pl < 92 < 2 

1 < 92 < Pl 

92 > 2 

1 < 92 < 2 

92 = 00 


1 


v'P2 - 1 


rfi/g2-i/Pi^P2 -1 

^(1/2-1/pi) 
rf(l/'J2-l/pi) 

Vlog(rf) 



Note that the first two rows are dimension free, and so apply also in infinite-dimensional settings, whereas in the 
other scenarios, D2 is finite only when the dimension is finite. An interesting phenomena occurs when d is 00, pi > 2 
and q2 > pi- In this case D2 = 00 and so one cant expect a rate of O(^). However we have Dp^ < 16 and so can 

i_ 

still get a rate of n 92 . 

Ball et al [3] tightly calculate the constants of strong convexity of squared £p norms, establishing the tightness of 

D2 when Pl = p2- By extending their constructions it is also possible to show tightness (up to a factor of 16) for all 

other values in the table. Also, Agarwal et al [2] recently showed lower bounds on the sample complexity of stochastic 

optimization when pi = 00 and p2 is arbitrary — their lower bounds match the last two rows in the table. 



Non-dual Schatten norm pairs in finite dimensions Exactly the same analysis as above can be carried out for 



Schatten p-norms, i.e. when W ~ B^ 



S{pi) 



X = B 



s(P2: 



are the unit balls of Schatten p-norm (the p-norm of the 



singular values) for matrix of dimensions di x ^2. We get the same results as in the table above (as upper bounds 
on D2), with d = minjiii, ^2}- These results again follow using similar arguments as £p case and tight constants for 
strong convexity parameters of the Schatten norm from [3]. 



Non-dual group norm pairs in finite dimensions In applications such as multitask learning, groups norms such as 
||w||q,i are often used on matrices w e M*^^'' where {q, 1) norm means taking the i?i-norm of the i^^-norms of the 
columns of w. Popular choices include q = 2, oo. Here, it may be quite unnatural to use the dual norm {p, oo) to define 
the space X where the data lives. For instance, we might want to consider W — 5(g,i) and X = -B(co,cx)) = ^oo- In 
such a case we can calculate that I?2(W, X) = 0(fc^^ « y/\og{d)) using vl>(w) = ,l_2 l|w|| ^ where r = ^^ °^^_ -^ . 

Max Norm Max-norm has been proposed as a convex matrix regularizer for application such as matrix comple- 
tion [21]. In the online version of the matrix completion problem at each time step one element of the matrix is 
revealed, corresponding to X being the set of all matrices with a single element being 1 and the rest 0. Since 
we need X to be convex we can take the absolute convex hull of this set and use X to be the unit element-wise 
£i ball. Its dual is ||VK||_:^.« = max^j |Wij|. On the other hand given a matrix W, its max-norm is given by 
ll^llmax ~ '^^'^u.V:W=uv^ (max^ |lf7i|l2) (maxj ||T/,||2). The set W is the unit ball under the max norm. As 
noted in [22] the max-norm ball is equivalent, up to a factor two, to the convex hull of all rank one sign matrices. Let 
us now make a more general observation. 

Proposition 10. Let W = abscvx({wi, . . . , wx})- The Minkowski norm for this W is given by 

K 



In this case, for any q G (1,2], if we define the norm : 



( ^ 

Iw" 



1/9 






then the function $(w) = ^^ _^, ||w||yy is 2-uniformly convex w.r.t. ||-||yy . Further if we use q ~ io°^k-1 ' ^^^^ 

Proof of the above proposition is similar to proof of strong convexity of iq norms. For the max norm case as noted 
before the norm is equivalent to the norm got by the taking the absolute convex hull of the set of all rank one sign 
matrices. Cardinality of this set is of course 2^+*^. Hence using the above proposition and noting that X* is the unit 

ball of I • loo we see that "ii is obviously 2-uniformly convex w.rt. ||-||_y* and so we get a regret bound O ( y ^^^ )• 
This matches the stochastic (PAC) learning guarantee [22], and is the first guarantee we are aware of for the max norm 
matrix completion problem in the online setting. 

Interpolation Norms Another interesting setting is when the set W is got by interpolating between unit balls of 
two other norms 1 1 • 1 1 yy and 1 1 • 1 1 yy . Specifically one can consider W to be the unit ball of two such interpolated norms, 
the first type of interpolation norm is given by. 



ll^llw - II^IIWi ^ II"' 
The second type of interpolation norm one can consider is given by 



(11) 



||w||^= inf (||wi||yy^ + !lw2|Uj (12) 

In learning problems such interpolation norms are often used to induce certain structures or properties into the regular- 
ization. For instance one might want sparsity along with grouping effect in the linear predictors for which elastic-net 
type regularization introduced by Zou and Hastie [28] (this is captured by interpolation of the first type between l\ 
and ii norms). Another example is in matrix completion problems when we would like the predictor matrix to be 
decomposable into sum of sparse and low rank matrices as done by Chanrdasekaran et. al [6] (here one can use the 
interpolation norm of second type to interpolate between trace norm and element wise i\ norm). Another example 



where interpolation norms of type two are useful are in multi-task learning problems (with linear predictors) as done 

by Jalali et. al [8]. The basic idea is that the matrix of linear predictors can is decomposed into sum of two matrices 

one with for instance low entry-wise £i norm and other with low -6(2,00) group norm (group sparsity). 

While in these applications the set W used is obtained through interpolation norms, it is typically not natural for the 

set X to be the dual ball of W but rather something more suited to the problem at hand. For instance, for the elastic 

net regularization case, the set X usually considered are either the vectors with bounded £ca norm or bounded £2- 

Similarly for the [8] case X could be either matrices with bounded entries or some other natural assumption that suits 

the problem. 

It can be shown that in general for any interpolation norm of first type specified in Equation 11, 

D2{W,X) < 2mm{D2iWi,X),D2{W2,X)} (13) 

Similarly for the interpolation norm of type two one can in general show that, 

D2{W,X) < ^rRax{D2{Wi,X),D2{W2,X)} (14) 

Using the above bounds one can get regret bounds for mirror descent algorithm with appropriate $ and step size ry for 
specific examples like the ones mentioned. 

The bounds given in Equations (13) and (14) are only upper bounds and it would be interesting to analyze these cases 
in more detail and also to analyze interpolation between several norms instead of just two. 

8 Conclusion and Discussion 

In this paper we showed that for a general class of convex online learning problems, there always exists a distance 
generating function \1/ such that Mirror Descent using this function achieves a near-optimal regret guarantee. This 
shows that a fairly simple first-order method, in which each iteration requires a gradient computation and a prox- 
map computation, is sufficient for online learning in a very general sense. Of course, the main challenge is deriving 
distance generating functions appropriate for specific problems — although we give two mathematical expressions for 
such functions, in equations (7) and (9), neither is particularly tractable in general. In the end of Section 6 we do give 
some general guidelines for choosing the right distance generating function. However obtaining a more explicit and 
simple procedure at least for reasonable Banach spaces is a very interesting question. 

Furthermore, for the Mirror Descent procedure to be efficient, the prox-map of the distance generating function 
must be efficiently computable, which means that even though a Mirror Descent procedure is always theoretically 
possible, we might in practice choose to use a non-optimal distance generating function, or even a non-MD procedure. 
Furthermore, we might also find other properties of w desirable, such as sparsity, which would bias us toward alterna- 
tive methods [12, 7]. Nevertheless, in most instances that we are aware of. Mirror Descent, or slight variations of it, is 
truly an optimal procedure and this is formalized and rigorously establish here. 

In terms of the generality of the problems we handle, we required that the constraint set W be convex, but this 
seems unavoidable if we wish to obtain efficient algorithms (at least in general). Furthermore, we know that in terms of 
worst-case behavior, both in the stochastic and in the online setting, for convex cost functions, the value is unchained 
when the convex hull of a non-convex constraint set [18]. The requirement that the data domain X be convex is 
perhaps more restrictive, since even with non-convex data domain, the objective is still convex. Such non-convex X 
are certainly relevant in many applications, e.g. when the data is sparse, or when x G A" is an indicator, as in matrix 
completion problems and total variation regularization. In the total variation regularization problem, W is the set of 
all functions on the interval [0, 1] with total variation bounded by 1 which is in fact a Banach space. However set X 
we consider here is not the entire dual ball and in fact is neither convex nor symmetric. It only consists of evaluations 
of the functions in W at points on interval [0, 1] and one can consider a supervised learning problem where the goal 
is to use the set of all functions with bounded variations to predict targets which take on values in [—1,1] . Although 
the total-variation problem is not learnable, the matrix completion problem certainly is of much interest. In the matrix 
completion case, taking the convex hull of X does not seem to change the value, but we are unaware of neither a 



guarantee that the value of the game is unchanged when a non-convex X is replaced by its convex hull, nor of an 
example where the value does change — it would certainly be useful to understand this issue. We view the requirement 
that W and X be symmetric around the origin as less restrictive and mostly a matter of convenience. 

We also focused on a specific form of the cost class F, which beyond the almost unavoidable assumption of 
convexity, is taken to be constrained through the cost sub-gradients. This is general enough for considering supervised 
learning with an arbitrary convex loss in a worst-case setting, as the sub-gradients in this case exactly correspond to 
the data points, and so restricting F through its sub gradients corresponds to restricting the data domain. Following 
Proposition 1, any optimality result for Jxip also applies to J^3up, and this statement can also be easily extended to 
any other reasonable loss function, including the hinge-loss, smooth loss functions such as the logistic loss, and even 
strongly-convex loss functions such as the squared loss (in this context, note that a strongly convex scalar function for 
supervised learning does not translate to a strongly convex optimization problem in the worst case). Going beyond a 
worst-case formulation of supervised learning, one might consider online repeated games with other constraints on F, 
such as strong convexity, or even constraints on {/j} as a sequence, such as requiring low average error or conditions 
on the covariance of the data — these are beyond the scope of the current paper. 

Even for the statistical learning setting, online methods along with online to batch conversion are often preferred 
due to their efficiency especially in high dimensional problems. In fact for ip spaces in the dual case, using lower 
bounds on the sample complexity for statistical learning of these problems, one can show that for large dimensional 
problems, mirror descent is an optimal procedure even for the statistical learning problem. We would like to consider 
the question of whether Mirror Descent is optimal for stochastic convex optimization, or equivalently convex statistical 
learning, setting [9, 19, 23] in general. Establishing such universality would have significant implications, as it would 
indicate that any (convex) problem that is leamable, is learnable using a one-pass first-order online method (i.e. a 
Stochastic Approximation approach). 
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Appendix 

Proof of Lemma 2 (generalized MD guarantee). Note that for any w* G W, 



\t=i t=i / t=i 

n 
n 

t=i 

n 

" / P 1 

Using simple manipulation we can show that 

(V*(w4) - V*(w4+i), wt+i - w*) = A* (w*|wt) - A* (w*|wt+i) - A^ (wf+i|wt) 
where given any w, w' G B, 

A* (w|w') := vl/(w) - *(w') - (V*(w'), w - w') 
is the Bregman divergence between w and w' w.r.t. function \&. Hence, 
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., + (V*(wO - V*(w;+i), w;+i - w*) 



,, + A* (w*|wt) - A* (w*|wt+i) - A* (wt+ijwt' 



,, + A* (w*|wt) - A* (w*|wf+i) - A* (wt+i|wt' 



, - A* (wt_^i|wt) I + A^ (w*|wi) - A^ (w*|w„+i) 



,, -Avt(w;_^i|wt) +*(w*) 



Now since ^P is g-uniformly convex w.rt. ||-||_^-*, for any w, w' G B*, A* (w'|w) > - ||w — w'||J?^*. Hence we 



conclude that 



n n p— 1 ^ 

E/*(^*)-E/*(^*)^'^Eii^/*(^*)ii^ 

t=l t=l ^ i=l 
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1/p 



Plugging in the value of 77 = I '^"^'^^„'^ — ^—^ ) we get : 



1/9 



E /*(^*) - E /*(^*) < 2 sup *(w) (Bn)i/P 

dividing throughout by n conclude the proof. 

Lemma 11. Let 1 < p < 2 and C > be fixed constants, the following statements are equivalent . 
1. For all sequence of mappings (x„)„>i where each x„ : {±1}"^^ h- > B* and any xq G B*: 

P 1 



supE 



^0 + ^<^i^i{<^) 



w*-i 



<cM||xo||^ + Ee[||x„(6)||^] 



n>l 



n 



2. There exist a non-negative convex function ^! on B with $(0) — 0, that is q-uniformly convex w.r.t. norm 
and for any ^e B, \ ||w||«., < *(w) < ^ ||w||^. 

Proof For any x e S* define ** : S* h^ M as 



\x* 



vl'*(x) := sup 



CP 



supE 



X + EfiXi(f) 



-Ee[||x,(6)||^] 



where the supremum is over sequence of mappings (x„)„>i where each x„ : {±1}" ^ h^ ,6* and the sequence is 
such that, sup E [||x + X^ILi ^«llvv*] "^ °°' '^i'^'-^ supremum of convex functions is a convex function, it is easily 

n 

verified that '^*{-) is convex. Note that by the definition of M-type in Equation 8, we have that for any xq G B*, 
^*(xo) < ||xo||^. On the other hand, note that by considering the sequence of constant mappings, x,; = for all 
i > 1, we get that for any xq G ,8*, 



**(xo) =sup<( I ^supE 



Xo 



E^i^i(* 



-EE[||x,(e)r^] 






Thus we can conclude that for any x e B* , -^ ||x||^* < ^*(x) < ||x||^. 

For any xq, yo G B*, by definition of ^* (xq) and $* (yo), for any 7 > 0, there exist sequences (x„)„>i and (y„)„>i 
s.t. : 



**(xo)< I ^supE 



Xo 



y^£tXj(£ 



i=l 



W* 



EE[||x.(e) 



\x\ + 7 



i>i 



and 



**(y^''^)< I ^supE 



yo + E'^^y^^'^) 



i=l 



W* 



EE[l|y.WII^] +7 



In fact in the above two inequalities if the supremum over n were achieved at some finite no, by replacing the original 
sequence by one which is identical up to n^ and for any i > uq using x^ (e) — (and similarly y^ (e) — 0), we can in 
fact conclude that using these x's and y's instead. 



^^(xo) < I ^E 



Xo +y^^eiXi{e 
i>i 



w* 



-EE[||x,(6)|i^] 



(15) 
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and 
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Now consider a sequence formed by taking zg = ^°^^" and further let 

/ 1 + eo \ xo - yo / 1 - eo \ yo - xo 
zi = 1 -^^- ) ?^ + ^7^— o = eo(xo - yo) 



2 J 2 

l + eo' 
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and for any i > 2, define 



where cq € {±1} is drawn uniformly at random. That is essentially at time i = 1 we flip a coin and decide to go with 
the sequence (x„)„>o with probability 1/2 and (yn)n>o with probability 1/2. Clearly using the sequence (z„)„>i, 
we have that, 
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where the last step is obtained by using Equations 15 and 16. Since 7 was arbitrary taking limit we conclude that for 

any xo and yo, 

Xo -yo 



^*(xo) + ^*(yo) < ^* / Xo + yo 



X 
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Hence we have shown the existence of a convex function ^* that is p-uniformly smooth w.r.t. norm ||-||^ such that 
(5^ IMIvv* — ^*(') ^ ll'll!ir- Using convex duality we can conclude that the convex conjugate ^ of function ^*, is 
g-uniformly convex w.r.t. norm || • \\x* and is such that ||-||^ < \['(-) < C ||-||^. That 2 implies 1 can be easily 
verified using the smoothness property of ^*. 

a 

The following sequence of four lemma's give us the essentials towards proving Lemma 5. They use similar techniques 
as in [16]. 

Lemma 12. Let 1 < r < 2. If there exists a constant D > such that any xq G B* and any sequence of mappings 
(x„)„>i, where x„ : {±1}"^^ h^ B* satisfy : 



Vn e N, E 



xo + ^eiXi(e) 



w*J 



< D{n+lf/'' sup sup||x,(e)|| 



0<i<n e 



X 



then for all p < r and ap — ^^ we can conclude that any xg G B* and any sequence of mappings (x„)„>i, where 
x„ : {±1}"^^ H^ B* will satisfy : 



supE 
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< apsup ^llxi(e) 
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\x 



, i>0 



Proof To begin with note that in the definition of type, if the supremum over n were achieved at some finite uq, then 
by replacing the original sequence by one which is identical up to ng ^nd then on for any i > uq using x^ (e) — 
would only tighten the inequality. Hence it suffices to only consider such sequences. Further to prove the statement 
we only need to consider finite such sequences (ie. sequences such that there exists some n so that for any i > n, 
Xi = 0) and show that the inequality holds for every such n (every such sequence). 

Restricting ourselves to such finite sequences, we now use the shorthand, 
S = sup, (Er=o l|x.(e)||^)'/^. Now define 

T^^\e) ^ mi{i e hie)} and 

Vm e N, rW(e) = inf{z > ri'!i(e), z € hie)} 



Note that for any e e {±1}^ 



s^> J2 \\M^)rx>'-^ 



ieik((-) 



and so we get that sup, |^fc(e)| < 2*^+^. From this we conclude that 



E 



xo + ^Xj(e) 



w* 



< 



E^ 



fe>0 



fc>0 



E ^«(^) 

i&Ik((-) 



E- 

«>0 



Tl^He) 



W* 



W* 



<Y.\D sup{|4(6)|i/'-}sup{ sup |lx,(6)|U} 

fe>0 V ' " *e/.(e) ^ 

<^U2('=+i)/'-sup sup |lx,(6)|U J 
fe>0 V " "^'>'^'-'> J 



15 



fe>0 



fe>0 



2D 



— 


1- 


-2(^- 


|)^ 


^ 




2D 






1- 


-2-('' 


-p)/4 


< 


12D^ 





5 



r ^ p 



apsup ^||x,(e)||^ 



i/p 



n 

Lemma 13. Let 1 < r < 2. If there exists a constant D > such that any xq G i5* anof any sequence of mappings 
(x„)„>i, where x„ : {±1}"^^ h^ ,B* satisfy : 



Vn e N, E 



xo + ^eiXi(e) 



w*J 



<i?(n+l)i/'- sup sup|lx,(e)||;^ 

0<i<n e 



then for any p < r, any xo G i3* and any mapping (x„)„>i, where x„ : {±1}" ^ i~> ,8*; 
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w* 
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Xo + y^£iXi( 

Proof For any xg G ^8* and sequence (x„)„>i define 

K(6) = ^||x,(e)||^ 



> c 1 < 2 ( ^ V/^"+'^ ( llxoll^ + ^ E [||x,(6)||^] 



A" 



i=0 



For appropriate choice of a > to be fixed later, define stopping time 

T(e) = inf{n>0|y„+i > a^} 
Now for any c > we have, 
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^£iXi(e 



> c 
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As for the first term in the above equation note that 
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To consider the second term of Equation 17 we note that ( l{r(£)>o} (^o + 'J27=i eiXi(e)) ) is a valid martingale 

V / n>0 



(stopped process) and hence, 
inequality we conclude that, 
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is a sub-matingale. Hence by Doob's 
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Applying conclusion of the previous lemma we get that 
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Plugging the above and Equation 18 into Equation 17 we conclude that: 
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Using a = (^ {||xo||^ + Ej>i E [|lxi(e)||^] ] ] we conclude that 
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This conclude the proof. 
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Lemma 14. Let 1 < r < 2. If there exists a constant D > such that any xq G i3* anJ any sequence of mappings 
(x„)„>i, where x„ : {±1}"^^ i— > ,B* satisfy : 



Vn e N, E 



Xo + y^£iXi(£) 



w*J 



< i:)(n+ 1)^/'' sup sup||xi(e)||_:^, 

0<i<n e 



then for any p < r, any xq G i3* and any sequence (x„)„>i satisfies 



sup AP P sup 

A>0 \ n 



W* 



Xo + ^eiXi(e) 
1=1 

<max<'4^a, ( ||xo||^ + ^ E [||x,(6)||^] 1 , 2^^+^ iog(2) a^ I ||xo||^ + J^E [||x,(6)||^] 



Proof. We shall use Proposition 8.53 of Pisier's notes which is restated below to prove this lemma. To this end consider 

any xo G B* and any sequence (xi)i>i. Given an e G {±1}'^, for any j G [M] and « G N let e]-'' = e(i_i)M+j- Let 
Zq = Xo M~^^P and define the sequence (zi)j>i as follows, for any fc G N given by fc = j + {i — 1)M where j G [M] 
and i G N, 

Zfc(e)=x,(e(-''))M-i/P 
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Clearly, 



M 






Xfc(6(^)) 



X. 



i>\ 



By previous lemma we get that for any c > 0, 
P [sup 



zo + y^£tZt(e) 



W* 






1/(P+1) 



1/(P+1) 









Note that 

sup 

n 

Hence we conclude that 



sup M~^/''sup 
je[M] « 



y^eiZi(t 



Zo + > CiZile; 



w* 



M"^/P sup sup 



X0+5:6|^"'X,(.(^)) 



w* 



X0+^6P")X,(.(^)) 



>cl <2''"^^'^+^' 



w* 



For any j G [M], defining Z(j') = sup„ xq + Er=i e?'x,(e(j)) 
any c > 0, 



w* 



xo||^+^E[||x,(e)||^] 
and using Proposition 16 we conclude that for 



sup A^ P sup 

A>0 \ n 



xo + ^eiXi(e) 
< max I c, 2cP log 



> A 



w* 



M 



i-2(^)^ iixor^+E.>iE[iix.wr^]r''"Vj 



Picking 



we conclude that 



i/p 



c = 4^aJ||xo||^+^E[||x,(6)||^] 



sup AP P sup 

A>0 \ n 



> A 



Xo + ^eiXi(e) 
1=1 

<max<'4^a, ( ||xo||^ + J] E [||x,(6)||^] I X''^Hog{2) a^ I ||xo||^ + J] E [||x,(6)||^] 



i>i 



i>l 



n 
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Lemma 15. Let 1 < r < 2. If there exists a constant D > such that any xq G B* and any sequence of mappings 
(x„)„>i, where x„ : {±1}"^^ h-> B* satisfy : 



Vn e N, E 



xo + ^eiXi(e) 



w*-i 



<D{n+lf''' sup sup||xi(e)||_:^, 

0<i<n e 



then for all p < r, we can conclude that any xq € B* and any sequence of mappings (x„)„>i where each x„ : 
{±l}"-i H^ ^* vvi7/ satisfy : 



supE 



xo +^eiXi(e) 



w*-i 



<I^^^Vll|xoir^ + ^E[||x,(e)||^j 



(^-p)^ 



TTzflf is the pair (W, A") is of martingale type p. 

Proof. Given any p < r pick r > p' > p, due to the homogeneity of the statement we need to prove, w.l.o.g. we can 
assume that 



|xo||^+^e[||x,(6)||^ 



= 1 



i>l 



Hence by previous lemma, we can conclude that 



Hence, 



E 



sup 



sup XP P sup 

A>0 \ n 



Xo + ^ejXi(e) 



Xo + y^e,x,(£) 



>A <p'22p'+3log(2)a^;<(32ap,f' 



w* 



W*J 



< inf {aP +p XP 



a>0 



sup 



xo + ^eiXi(e) 



> A UA 



< mf Lp+p{32 ap,)P' f XP-^-P' dx\ 



w* 



< inf {aP+p(32a„AP 

a>0 V F / 



xp-p 

p-p' 



(19) 



< ■mi{aP + {A(Sap,)P 

a>0 1 P ^V ] 



Since ||xo||^ + X]i>i 11^ l|xi(e)||^ — 1 andp' > p, we can conclude that ||xo||^ + X]i>i 11^ [l|xi(e)||^] > 1 and so 



E 



sup 



xo + ^QXj(e) 



i=l 



W* 



^^ (46a,)^^^ |||xo||^+^E[||x.(e)r^] 



< 2 



(p' - p)p/p' 

(46 ap)P 
(p' - p) 



xo||^+^E[||x,(6)||^] 



Since p' can be chosen arbitrarily close to r, taking the limit we can conclude that 

p -\ 



E 



sup 



xo + ^eiXj(e) 



w*J 



< 2 



(46 ap)P 
(r-p) 



lxo|l^+^E[|lx,(6)||^] 
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Recalling that ap = ^^ we conclude that 



P r—'p 



E 



sup 



xo + y^e^x^(e) 



i=l 



w* 



< 



< 



11041) 



(r-p)(p+i)/p 
11041) 



[r -pf 



xo||^+5]E[|lx,(e)||^ 



i>i 



This concludes the proof. 

We restate below a proposition from Pisier's note (in [17]) 



a 



Proposition 16 (Proposition 8.53 of [17]). Consider a random variable Z > and a sequence Z^^\ Z^'^' , . . . drawn 
iidfrom some distribution. For some < p < oo, < S < I and R > 0, 

sup P ( sup M-i/PZ(™) > r] <S =^ sup AP P (Z > A) < max \ R, 2RP log 

M>1 \m<M ) A>0 I V 1 — 



Proof of Lemma 5. By Theorem 4 and our assumption that Vn(W, A") < -Dn '^ ^Z*"', we have that for any sequence 
(x„)„>i such that x„ : {±1}"^^ h^ X and any n > 1, 



E 



^^^^^{^) 



W* 



< 



i?n-(i-^) 



Hence we can conclude for any sequence (x„)„>i such that x„ : {±1}" ^ ^^ B* and any n > 1, 



E 

Hence for any xg G ,8*, we have that 
E 



y^£iXi((! 



w*J 



< Drf sup sup |jxi(e)||_:^, 

l<i<n e 



Xo + y^£iXi(e) 

i=l 



W* 



Now applying Lemma 15 completes the proof. 



<E 



<E 



y^£iXi(f 



i=l 



y^£iXi(f 



i=l 



W* 



W* 



xollw* 



^llxol 



A" 



< Dn^ sup sup ||xi(e)||_;^, + D j|xo 

l<«<n e 

<2_D(n+l)^ sup sup ||xi(e)||_y 

0<i<n £ 



U 



n 
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